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This paper presents the architecture of a special-purpose computer 
for logic simulation using distributed processing. The architecture is 
based on the utilization of inexpensive microprocessors intercon- 
nected by a communication structure. The communication structure 
is cross-point based for simple evaluations and time-shared parallel 
bus based for functional evaluations. Analysis is carried out to show 
that the performance of the proposed simulator is better by over two 
orders of magnitude than traditional logic simulation carried out on 
a general-purpose computer. Also, the power of the simulator is 
proportional to the number of slave processors over a certain range. 

I. INTRODUCTION 

The logic circuit simulator is an important component of a CAD 
system. It is used to predict logic circuit operation and performance 
under normal and faulty conditions. The application of the logic circuit 
simulator can be divided into two major areas: verification of new logic 
hardware designs and fault analysis of these designs. 

As an evaluation tool, it can be used to verify the logical correctness 
of new hardware designs. Other information that can be obtained using 
a logic simulator include timing and signal propagation characteristics, 
and race and oscillatory circuit conditions. If the results of simulation 
indicate an unsatisfactory design, i.e., the circuit does not perform as 
expected, then changes can be made to the design. The design can be 
reevaluated using the simulator. After a number of iterations, a satis- 
factory design ready for committing to hardware should result. A logic 
circuit simulator used as above is known as a true value simulator. 



* This paper is based on material to be submitted by S. H. Patel in partial fulfillment 
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Note that the results of true value simulation can be used for diagnos- 
ing faulty equipment using the guided probe technique. 1 

Most logic circuit simulators also have fault-simulation capability. 
This capability can be used to determine fault coverage by a proposed 
test sequence, production of a fault dictionary, evaluation of test-point 
effectiveness, and evaluation of self- checking circuitry. The behavior 
of a circuit under fault conditions can also be investigated. Fault 
analysis can be used to investigate initialization and fault-induced 
races, and to perform timing analysis under specific fault conditions. 
For fail-safe circuitry, selected faults can be inserted in the circuit and 
the effect of these observed on the outputs by simulation. If a forbidden 
output is obtained under some fault, then the circuit design must be 
changed. 

Figure 1 shows the general environment of a logic circuit simulator. 
The circuit to be analyzed is modeled using a circuit-description 
language. This language describes the connectivity and behavior of the 
circuit. The modeling information typically includes element type 
(gate or functional), associated delays, and interconnection data. Once 
the data structure of the model is set up in the logic circuit simulator, 
simulated inputs are applied either dynamically (at prescribed times) 
or statically (after the circuit is stabilized). Fault simulation is per- 
formed using one of several methods: one fault at a time, parallel, 
concurrent, or deductive. 2 " 4 The simulated output is recorded either in 
a plot form (true value) or tabular form (true value and fault simula- 
tion). 

Currently, most digital circuits are simulated on large general-pur- 
pose computers. This method of simulation is complex and expensive 
to operate and maintain. 5 There is a need for more sophisticated and 
cost-effective simulators as we get into the VLSI era. Very large 
simulation time and costs will result when dealing with circuits of 
VLSI complexity (more than 100,000 gates on a single chip). 
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Fig. 1 — Operating environment of a logic circuit simulator. 
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II. SPECIAL-PURPOSE SIMULATION HARDWARE 

Existing logic circuit simulators are implemented in software that is 
executed on a general-purpose computer. To date, a large amount of 
work has been invested in optimizing this software. For further im- 
provements in the performance of the logic circuit simulator, the 
hardware on which the simulation software executes must be opti- 
mized. With the advent of low-cost microcomputers, the development 
of special-purpose logic simulation hardware becomes attractive. Pos- 
sible benefits are higher speeds, lower costs, and greater flexibility (for 
example, better integration in a test station). 

Recently there has been some interest in developing special-purpose 
logic simulators. Barto and Szygenda have developed special-purpose 
simulation hardware based on distributed processing. 5 More recently, 
Abramovici et al. have presented a special-purpose architecture based 
on pipelining and concurrency. 6 Both approaches use dedicated proc- 
essors for performing specific tasks. A special-purpose logic simulation 
machine using parallel processing (the Yorktown Simulation Engine) 
has been built by IBM. 7 ' 9 

Concurrency allows simultaneous processing of several parallel 
events leading to a reduction in overall processing time. There are at 
least two types of concurrencies present in the simulation of logic 
circuits. One type of concurrency occurs in the simulation algorithm 
and the other in the actual simulated hardware. 

The first type of concurrency can be called algorithm, concurrency. 
In logic circuit simulation a number of operations have to be performed 
during a simulated time interval. Simulated time consists of discrete 
points in time (approximated to the nearest integer) at which changes 
in logic values on signal lines can occur. A simulated time interval is 
the time between two such consecutive discrete points. A simulated 
time interval is also sometimes referred to as a time frame. A simula- 
tion cycle time is the time required to carry out the processing during 
a simulated time interval. Typical operations carried out during a 
simulation cycle include determining current events, updating values 
at source, determining fanout, updating values at fanout, evaluating 
elements, and scheduling resulting events. An event is a change in logic 
value on a signal line. Scheduling an event is marking it to occur at 
some time in the future. Consider several elements being evaluated 
and several resulting events being scheduled during a simulation cycle. 
In traditional simulation, the elements are evaluated and the resulting 
events scheduled sequentially. No two operations are performed si- 
multaneously. One can take advantage of the inherent concurrency by 
noting that after an element has been evaluated and while the resulting 
event is being scheduled, the evaluation of another element can be 
concurrently started. An average number of 80 such concurrent events 
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were observed per simulated time interval during the simulation of a 
4000-gate circuit. 6 This concurrency appears in the simulation algo- 
rithm. The architecture proposed by Abramovici et al. takes advantage 
of this concurrency to some extent. The main ingredient of this solution 
is functional partitioning of the tasks to several microprocessors. 

The concurrency can also be viewed from the point of view of the 
logic circuit. Concurrent events occur during a simulation cycle because 
of the way electrical signals propagate in the logic circuit. Several 
elements may be activated at the same time because signal propagation 
occurs simultaneously along several paths in the actual hardware. If 
the elements that become active at the same time are processed by 
different processors simultaneously, then the overall simulation time 
will be reduced. This type of concurrency can be called logic circuit 
concurrency. Its main ingredient is distributed processing among 
several processors, all of which are executing the same algorithm. 

This paper describes the architecture of a special-purpose logic 
simulation machine designed to take advantage of the parallelism 
caused by concurrent activity of signals in a circuit. The system is 
essentially a processing network based on an interconnection of low- 
cost microcomputers. The circuit to be simulated is partitioned into 
subcircuits and each subcircuit is simulated in a separate microcom- 
puter. Thus, several microcomputers can be simultaneously simulating 
several elements activated by parallel signals. This simulator is differ- 
ent from those proposed in the literature (see Refs. 5 and 6) in that the 
multiple processors do not perform dedicated tasks. Also, the modu- 
larity of the simulator proposed in this paper allows easy increase of 
computational power. 

III. MULTIPROCESSOR OPERATION ENVIRONMENT 

The operation environment of the multiprocessor digital logic sim- 
ulator is shown in Fig. 2. The general-purpose computer acts as a 
preprocessor at the beginning of simulation and as a postprocessor at 
the end of simulation. 

At the beginning of simulation, the circuit to be simulated is modeled 
on the general-purpose computer. The data structure is then loaded 
into the multiprocessor simulator. The loading problem is not dis- 
cussed in this paper. After setting up the environment for the multi- 
processor simulator, the general-purpose computer requests the sim- 
ulator to start. 

The simulation is carried out in the multiprocessor simulator. The 
simulator can be programmed to output intermediate results automat- 
ically to the general-purpose computer. The simulator can also be 
interrupted by the general-purpose computer for intermediate results. 
The user can ask for information about a simulation run while it is in 
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Fig. 2 — Preprocessing and postprocessing for the multiprocessor-based special-pur- 
pose logic simulator. 



progress (e.g., the status of a variable) and make certain run-time 
decisions such as continue simulation, apply extra input vectors, or 
stop. At the end of circuit simulation, the final simulation results and 
any other user-requested information is sent to the general-purpose 
computer. User-requested information typically includes output values 
of elements (monitored points) at specific simulated times or under 
some other specified conditions. The general-purpose computer for- 
mats this information for suitable presentation to the user. 

IV. ARCHITECTURE DESCRIPTION 

The multiprocessor simulator consists of processors pi through p n . 
The circuit to be simulated is partitioned into blocks ai through a n . 
The signal connections between two blocks ai and aj are designated as 
bij. Each block a x is then mapped into processor p x as a subcircuit c x . 
Figure 3 shows two blocks a, and aj mapped to processors p ; and pj, 
respectively, as subcircuits Ci and Cj. The blocks are not necessarily 
clusters. That is, elements in a block can be from disjoint portions of 
the circuit. The signal connections bij between blocks a t and aj are 
mapped in a data path djj between processors ps and pj . 

During simulation the subcircuits c ; and Cj are simulated independ- 
ently. Different subcircuits become active as signal values proceed 
from the primary inputs to primary outputs. As simulation progresses, 
data will have to be carried between subcircuits Cj and Cj as the logic 
values on the signal connections between the two subcircuits change. 
This data is transported across the data path dij. Typical data sent 
across the data path consists of changed logic values. 

The architecture of the multiprocessor simulator proposed in this 
paper is shown in Fig. 4. The simulator consists of a communication 
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structure (communication medium and its associated control) con- 
nected to a master, several simple evaluators for simulating gate level 
blocks, and several functional evaluators for simulating functional 
blocks. A cross-point matrix is used to interconnect the master and 
the simple evaluators. The functional evaluators are connected to the 
cross-point matrix through a bus interface unit and a parallel bus. It is 
shown in Section VI that the speed of a cross-point matrix is required 
for transferring data between the simple evaluators. A parallel bus 
provides sufficient speed for functional evaluators. Note that if only 









COMMUNICATION STRUCTURE 

1 




CROSSBAR 
MATRIX 






















BUS INTERFACE 
UNIT 
















1 












MASTER 




SIMPLE 

EVALUATOR 

1 




SIMPLE 

EVALUATOR 

n 




PARALLEL 
BUS 




























_J 




FUNCTIONAL 

EVALUATOR 

1 




FUNCTIONAL 

EVALUATOR 

n 



Fig. 4— Architecture of proposed multiprocessor simulator. 
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simple evaluations are to be carried out, then the bus interface unit, 
the parallel bus, and the functional evaluators can be removed. If only 
functional evaluations are to be performed, then the bus interface unit, 
the cross-point matrix, and the simple evaluators can be removed. 

V. MULTIPROCESSOR IMPLEMENTATION 

The configuration of the multiprocessor-based digital logic simulator 
is shown in Fig. 5. The simulator consists of one master and a 
multiplicity of slaves interconnected by a communication structure 
(communication medium and its associated control). The implemen- 
tation of the simulator allows the use of either a shared or a dedicated 
communication structure. The master has local memory for its use. 
Each slave consists of a processing unit (PU) with its associated 
memory. The master and the slaves also each have two FIFO buffers 
and two data sequencers for interfacing to the communication struc- 
ture (see Sections 5.1 and 5.2). 

The processors p s and pj shown in Fig. 3 correspond to the slaves Si 
and Sj in the multiprocessor simulator. The subcircuits Ci and Cj reside 
in the memories of the slaves Si and s, . The interconnections by between 
blocks aj and aj are mapped into the data path dij that constitutes part 
of the communication structure. 

At the beginning of each simulation cycle the master sends primary 
input values (if any) to the appropriate slaves using the communication 
structure. The master then issues a start signal to the slaves. This 
signal informs the slaves to start processing for the next simulation 
cycle. During the processing of a simulation cycle a slave unit may 
generate data for the other slaves or the master. The data is sent to 
the destination slave or the master using the communication structure. 
Data transferred between the slaves consists of scheduled events for 
the next time frame. A scheduled event is a change in logic value on a 
signal line scheduled to occur at some time in the future. Only data for 
the subsequent time interval is transferred between the slaves in order 
to reduce the amount of information sent over the communication 
structure, and thus the communication overhead. The scheduled time 
does not have to be sent. Data transferred from the slaves to the 
master consist of primary output values and user-requested informa- 
tion. 

Each slave informs the master when it has finished processing and 
transferring data for the current simulation cycle. When all slaves have 
informed the master about their completion of processing for the 
current simulated time interval, and also the master has finished 
transferring any primary input values scheduled for the next simulated 
time interval to the slaves, the master issues a start signal to the slaves 
for the next simulation cycle. 
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Fig. 5 — Implementation of a multiprocessor-based digital logic simulator. 

The different sections of the multiprocessor are discussed in greater 
detail below. 

5. 1 Slave unit 

The slave unit configuration is shown in Fig. 6. The PU is a general- 
purpose 16-bit microprocessor. The input and output data sequencers 
can be either specially designed logic circuits or commercially available 
single-chip microcomputers. The FIFO buffers are commercially avail- 
able devices. 

The slave unit PUs perform the actual element/function evaluation 
and event scheduling. As noted previously, the partitioning of the logic 
circuit to be simulated is done on a general-purpose computer. Each 
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Fig. 6 — Slave unit configuration. 

PU contains a block of the logic circuit to be simulated. The simulation 
algorithm resides in the PU memory; all slaves contain identical 
algorithms. 

Each slave uses an Output Data Sequencer (ODS) to transfer data 
out and an Input Data Sequencer (IDS) to receive data from the other 
slaves or the master via the communication structure. In a slave unit, 
the PU and the data sequencers are isolated from each other by means 
of FIFO buffers. Thus, the slaves can send and receive data independ- 
ently of whether the PU is active or not. 

The PU stores any data it has for other PUs or the master in the 
Output FIFO Buffer (OFB). The ODS makes a request for the com- 
munication structure if there is any data to transfer from the OFB. 
The ODS of the slave, if granted the use of the communication 
structure, takes data (scheduled values for slaves and primary output 
values and user-requested information for the master) from the OFB 
and sends it over the communication structure to other slaves or the 
master. The data is received by the IDS of the destination slave or the 
master (described in Section 5.2). Any data received by an IDS is put 
in its Input FIFO Buffer (IFB). End of Data (EOD) flags are used to 
separate data streams since a PU can be writing new data to the OFB 
before its ODS has finished transferring current data and similarly, an 
IDS can be receiving new data in the IFB before its PU has finished 
reading current data. 

There are two signal lines between a slave unit and the master. The 
master signals the slaves using a START signal and the slaves signal 
the master using the DONE line. The DONE line will become active 
when all the slaves have finished processing. 



LOGIC SIMULATION 2881 



The START line from the master initiates the processing for the 
next PU processing cycle. This signal causes the IDS to load an EOD 
flag in the IFB of all slaves. At the beginning of each simulation cycle, 
the PU monitors the IFB for data from other slaves and the master for 
the current simulation cycle. The EOD flag marks the end of data 
from other slaves and the master for the current simulation cycle. 
When the PU reads this flag, it starts evaluation for the current 
simulation cycle. The START signal from the master also informs the 
slave ODS to start sending out any data to be used during the next 
simulated time cycle from the OFB. 

At end of the simulation cycle the PU loads an EOD flag in the 
OFB and starts preprocessing for the next simulation cycle. When the 
ODS encounters the EOD flag in the OFB, it has finished transferring 
data for the current simulation cycle. The ODS informs the master 
using the DONE line. 

5.1.1 PU operation 

The following data tables are used by each PU for its operation (see 
also Fig. 7): 

(i) Circuit Description Table. This table contains interconnection 
data for the subcircuit. For each element it contains the value, type, 
delay, input status word pointer and corresponding status fields, inter- 
nal fanout list pointer and corresponding fanout lists, and external 
fanout list pointer and corresponding fanout lists. The input status 
word pointer and the corresponding status fields give the signal values 
on the fanin lines. The internal fanout list pointer and corresponding 
fanout lists give the fanout which remain in the subcircuit. The 
external fanout list pointer and corresponding fanout lists give fanout 
which go to subcircuits located in other slaves. An element may have 
only internal fanout, only external fanout, or both internal and external 
fanout. Note that storing the external fanout takes up more space than 
storing an internal fanout since both the destination processor address 
and element index have to be stored for the external fanout. 

(ii) Activity List. This list is used to keep track of active elements 
during a simulation time interval. These elements are to be evaluated. 

(Hi) Timing Wheel. This data area contains the events that are 
scheduled in the future. A large amount of work has been done in this 

area. 10 - 11 

The PU operation can be described in terms of two essentially 
concurrent processes, namely the simulation cycle (execution of sim- 
ulation algorithm) and the communication cycle (communication of 
events). During one simulation cycle the following operations occur in 
the given order: 

1. Update line values from current list L t of timing wheel. 
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2. Find fanout and put active elements in activity list. 

3. From next list L t +i of timing wheel, for each entry that is an 
external fanout node, store scheduled event in OFB. Remove entry 
from timing wheel if it does not have any internal fanout also. 

4. Update line values from IFB until EOD flag is received. The 
EOD flag signifies end of data present in the IFB for the current 
simulation cycle. (The EOD flag is loaded by the IDS when it receives 
a START signal from the master.) 

Note: Any user requests received are stored for later processing. 

5. Find fanout and put active elements in activity list. 

6. Evaluate elements in activity list. 

7. For active elements whose output changes, if delay of element is 
one and it is an external output node, schedule change in OFB. 

8. If test in step 7 fails, schedule change on timing wheel. 
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9. Gather any user-requested information and store it in OFB. 

10. Store EOD flag in OFB. 

11. Increment current list pointer of timing wheel to the list of next 

time. 

From steps 7 and 8 it can be seen that events on an external fanout 
are scheduled on the timing wheel of the processor in which the event 
occurred except in the case where the external fanout is going to be 
active in the next time interval. In this way the scheduled time does 
not have to be transmitted with the scheduled event, thus saving 
communication overhead. 

A communication cycle is the period in between two START com- 
mands issued from the master. This cycle is phased with respect to the 
simulation cycle, as shown in Fig. 8. Note that the end of a communi- 
cation cycle is an appropriate point for the multiprocessor simulator 
to stop for processing any interrupt requests from the general-purpose 
computer or sending out intermediate results. The master can issue 
the start for the next simulation cycle by sending a START command 
to the slave data sequencers after it satisfies all requests. 

All the slave unit PUs contain the same software. Note that this 
algorithm is similar to the one used in traditional logic simulators 
except that the operations are sequenced differently. 

5.1.2 Operation of slave data sequencers 

The function of the data sequencers (IDS and ODS) is to transfer 
data from the OFB to the communication structure and from the 
communication structure to the IFB. 
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The IDS and the ODS have some local memory. The IDS uses this 
memory to store any data it receives from the outside in case the IFB 
overflows. The ODS requires this memory only in certain circum- 
stances. A case for its use is given Section 5.3.2. 

The communication protocol between the data sequencers and the 
communication structure depends on the type of communication struc- 
ture. This is discussed in detail in Section 5.3. 

5.2 Master processor 

The master processor is the interface between the general-purpose 
computer and the simulator. Its main functions are to keep track of 
simulated time, keep the slaves in synchronism, supply the slaves with 
primary input values, and gather the primary output values from the 
slaves. It also stores any user-requested monitored point values sent 
to it by the slaves. 

The configuration of the master is similar to that of a slave unit and 
is shown in Fig. 9. It consists of a central processing unit (CPU) with 
some local memory, an input FIFO buffer (IFB), an output FIFO 
buffer (OFB), an input data sequencer (IDS), and an output data 
sequencer (ODS). The master is connected to the slave units through 
the communication structure. The master initiates processing for the 
next simulation cycle by issuing a START command on its signal line. 
When the slaves finish processing for the current simulated time 
interval, they inform the master through the DONE signal. 
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Fig. 9 — Master configuration. 
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The master sends and receives data to and from the slaves using the 
communication structure. The interface between the master and the 
communication structure is similar to that between the slaves and the 
communication structure. The master stores any data (primary input 
values or user requests for monitored point values) it has for the slaves 
for the next simulation cycle in its OFB. The START signal which goes 
to the slaves also informs the ODS of the master to start transferring 
out data. The ODS encounters an EOD flag in the OFB when it has 
transferred all the data from the OFB. The ODS informs the master 
that it has finished sending out data by setting the DONE line. The 
IDS receives data from the slaves and puts it in the IFB. 

Figure 10 shows the master, slave unit PU, slave unit ODS, and 
slave unit IDS operations during a simulation cycle. 

5.3 Communication structure 

The communication structure is used as a medium for transferring 
data between the slaves and between the slaves and the master. Either 
a shared or a dedicated structure can be used for the multiprocessor 
simulator. Two types of communication structures will be considered 
here, namely the time-shared parallel bus and the cross-point matrix. 
Each case is treated separately below. The criteria for selecting the 
type of communication structure are given in Sections VI and VII. 



SLAVE ODS 



SLAVE IDS 



TRANSFER DATA DONE 

©■©■© 



START 

x— x — 



EOD 

ENCOUNTER 

IN OFB 



START 
X 



START 
X— X 



EOD 

ENCOUNTER 

IN OFB 



RECEIVE DATA 



EOD 
TO OFB 



SLAVE PU — H 



LOAD 
EOD 
IN IFB 



© © 



START 

— X — 



>x 



EOD 
TO 

OFB 



DATA TO 
OFB 



LOAD 
EOD 

IN 1 1- 1! 



'x+ 1 

:le +\ 



l x + 2 



(■■ ONE SIMULATION CYCL 

Fig. 10 — Master and slave unit operations during a simulated time interval. 



2886 THE BELL SYSTEM TECHNICAL JOURNAL, DECEMBER 1 982 



5.3. 1 Time-shared parallel bus 

The interface between the data sequencers and a time-shared par- 
allel bus based communication structure is shown Fig. 11. When the 
ODS has some data to send, it sets the Request To Send (RTS) line 
high. The bus control grants it the use of the communication medium 
by sending a pulse on the Bus Grant line. The ODS sends out all the 
data present in its OFB. The data received by the IDS of the desti- 
nation unit is put in its IFB. The ODS then sets the RTS line low. 
This releases the bus, which is then granted by the bus control to 
another requesting slave or the master. All units have equal priority. 
The ODS will set the RTS line high again if it gets more data to 
transfer in the OFB. 

The data sent out to a slave unit from another slave unit or the 
master consists of a scheduled event for the next simulation cycle. The 
data sent to the master consists of the address of the sending slave, 
element number (primary output or monitored point) and element 
value. A separate line Request to Send to Master (RTSM) is used to 
address the master. When the destination is the master, the address 
lines from the ODS contain the sending slave unit address. This 
address together with the element number and element value is stored 
in the master IFB by the master IDS. 

5.3.2 Cross-point matrix 

The interface between the data sequencers and a communication 
structure based on a cross-point matrix is shown in Fig. 12. When the 
ODS has some data to send, it puts the address of the destination unit 
on the address bus and makes a request to transfer data by sending a 
pulse on the RTS line. If the destination is not busy, the control for 
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the matrix grants the transfer request. The ODS sends out data serially 
over the data line. The data received by the IDS of the destination 
unit is put in its IFB. The Data Ready line connected to the IDS is 
used to show the presence of data. It might happen that the destination 
is busy when an ODS makes an RTS to the cross-point matrix. In this 
case, the ODS gets a busy signal from the control of the cross-point 
matrix. In response to the busy signal, the ODS stores away the data 
that was to be transferred in its local memory and makes an RTS for 
the next set of data to be transferred from the IFB. A retry to send the 
blocked data is made later. 

As indicated in Section 5.3.1, the master requires the address of the 
sending slave unit. A slave unit ODS will recognize a request to transfer 
data to the master and it will transmit its slave unit address together 
with element number and value. 

The interface discussed above will apply for a nonblocking switching 
network also. However, this network will not be discussed here. 



VI. ARCHITECTURE EVALUATION 

In this section, various performance functions are derived for the 
multiprocessor architecture and compared with those for the tradi- 
tional logic simulator implemented on a general-purpose computer. 
The requirements for a circuit-partitioning algorithm are considered 
first. Based on these requirements, expressions and values for process- 
ing and communication times are derived next. Comparisons in terms 
of evaluations per second are then made between the multiprocessor 
simulator and the traditional logic simulator implemented on a general- 
purpose computer. 
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Fig. 13 — A logic circuit. 



6. 1 Logic circuit partitioning 

The logic circuit to be simulated is partitioned into a number of 
subcircuits. Each subcircuit can essentially be regarded as an inde- 
pendent circuit. During simulation, data is passed between different 
subcircuits as signal values propagate from the primary inputs to the 
circuit outputs. The subcircuits will be stored in the appropriate 
memories of the slave unit PUs. (The element numbers in the original 
circuit are translated to a slave address and an index number.) Parti- 
tioning is a key to the operation and performance of the multiprocessor 
digital logic simulator. Partitioning must maximize multiprocessing 
while limiting communication. 

A partitioning algorithm is required to partition a logic circuit into 
subcircuits. The partitioning algorithm must produce subcircuits such 
that during logic circuit simulation, the number of simultaneously 
active subcircuits (processors) is maximum and the number of simul- 
taneously active elements in each subcircuit (processor) is minimum, 
while keeping the communication from being a bottleneck. Obviously, 
minimization of interprocessor communication and the proper choice 
of communication structure are necessary to avoid this bottleneck (see 
Sections VII and VIII). The fact that signals may propagate in parallel 
indicates that partitioning should be done along the depth of the 
circuit rather than the breadth of the circuit since this will tend to 
place concurrent activities in different blocks. One approach is to start 
with a primary input and trace a path towards a primary output 
forming an element string. An element string is a single fanout path. 
For example, elements (1, 6, 9, 11) and (5, 8, 10) in Fig. 13 constitute 
two element strings. Since two elements in a string will not be normally 
active simultaneously, the whole string can be put in one subcircuit. 
Each subcircuit corresponds to a block described in Section IV. Note 
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Fig. 14 — Multiprocessor simulator architecture. 



that if there is a fanout from any element to a zero delay element, then 
the two elements must be in the same block. For a multiprocessor 
system with n processors, a logic circuit must be partitioned into at 
most n blocks. A partitioning algorithm based on this approach is 
given in Appendix A. 

6.2 Processing and communication 

Consider the architecture of a multiprocessor simulator shown in 
Fig. 14. Processors pi through p n are connected by a communication 
structure. The logic circuit to be simulated is partitioned into n blocks 
using the partitioning algorithm described in Appendix A. Each block 
is then assigned to a processor. During simulation, the communication 
structure will be used to transfer data between the processors as the 
interconnections between different blocks become active. 

During a simulation cycle each of the processors will be evaluating 
active elements and scheduling events. Concurrently, data will be 
flowing across the communication structure as events in one processor 
activate elements in another processor. We define t p as the average 
processing time per processor during a simulation cycle. Time tp 
represents the processing of all active elements and scheduling of 
resulting events in one processor during a simulation cycle. Define t c 
as the total communication time during one simulation cycle. This 
value represents the amount of time the communication structure is 
busy (i.e., the time taken to service all requests to transfer data from 
all processors) during a simulation cycle. Since the operations during 
t p and t c occur concurrently, the length of the simulation cycle and, 
hence, the number of evaluations per second will be determined by 
the greater of tp and tc. 

Expressions and estimates for t p and t c for an optimum architecture 
will be derived in the next two sections. The value of t c will be derived 
for two cases, namely, a communication structure based on a time- 
shared parallel bus and a cross-point matrix. 

6.3 Processing time t p 

Let N be the average number of active elements per simulation cycle 
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and n be the number of processors in the multiprocessor simulator. 
Then N/n will be the average number of active elements per processor 
during a simulation cycle for the ideal case. There will be some extra 
active elements in a processor during a simulation cycle due to non- 
ideal partitioning. If k is the average unbalance factor, then the number 
of active elements is kN/n. Let a be the time required to process one 
active element, then: 

t p = (N/n) ka, 1< n < N; k = 1 for n = 1. 

This expression gives the average processing time per processor during 
one simulation cycle. 

The value of the processing time a is estimated next. Traditional 
logic simulators can perform about 30,000 simple evaluations per 
second (e.g., IBM 370/168). This represents 33 /is per element evalua- 
tion. The simulation algorithms for the multiprocessor simulator and 
the traditional logic simulator are somewhat similar. The element 
evaluation time for the multiprocessor simulator can be written as 
a = 33ui /us. The factor U] represents a slowdown factor due to the 
difference in speed between a microprocessor and a general-purpose 
computer. For an Intel* 8086 16- bit microprocessor the slowdown is 
about 5.5 over an IBM 370/ 168. I2 The element evaluation time for an 
Intel 8086 becomes 181.5 /is. The operation of the microprocessor can 
be speeded up by using a microprogrammable processor and micropro- 
gramming the simulation algorithm. For an Am29116 16-bit micropro- 
grammable microprocessor, a speed-up factor of 5 to 10 can be obtained 
over the Intel 8086. Assuming an Am29116 and taking a speed-up 
value of 7, the average time required to process a simple element will 
be a = 26 /is. The value of a for functional elements will be 30 to 50 
times larger. 

Until now it has been assumed that all the processors have the same 
number of active elements. Assume that the unbalance due to nonideal 
partitioning introduces 10 percent more active elements in a processor 
(k = 1.1) and a is 40 times larger for functional evaluations than for 
simple evaluations [a, se > = 26 /xs for simple evaluations and a<f e ) = 1040 
/is for functional evaluations]. The processing time for simple evalua- 
tions becomes t p( se) = 28.6(N/n) /is. The processing time for functional 
evaluations becomes t p <r e ) = 1144(N/n) /is. 

The processing time per active element (t r ,/N) as a function of the 
number of processors (n) in the multiprocessor simulator is plotted for 
simple and functional evaluations in Fig. 15. If there is only one 
processor, then the processing time per active element for one simu- 



* Trademark of Intel Corporation. 
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Fig. 15— Processing time as a function of number of processors for functional and 
simple evaluations. 



lation cycle is ti (se )/N = 26 /xs for simple evaluations and ti( fe )/N = 1040 
lis for functional evaluations. Note that the unbalance factor k does 
not apply when there is only one processor. 

The above analysis is done for n «: N. For n < N and n > N there 
will be one active element per processor in the best case and the 
processing time per active element will remain constant at t p = (a/N ) 
for all values of n greater than N (see Fig. 15). Also if n > N and there 
is unbalance, then the value of k will be much larger. 

6.4 Communication time t c 

The value of t c will depend on the type of communication structure. 
Two types of communication structures will be considered here, a 
time-shared parallel bus and a cross-point matrix. 

To estimate t c for each case, first consider an element string yielded 
by the partitioning algorithm discussed previously (see Fig. 16). If f is 
the average fanout, one fanout line remains in the processor and f - 1 
fanout lines go out to elements in other processors except the end of 
the string, where all fanout lines go to elements in other processors. 
Let c be the average number of elements in one string. Normally, one 
element per string is active during a simulation cycle. Define two 
adjacent element strings as two strings with common interconnections. 
Assume that all adjacent strings are in separate processors. This will 
give the situation that requires most communication and therefore the 
fastest communication structure. The average number of communi- 
cation events generated by one active element during a simulation 
cycle that have to be sent over the communication structure is: 

e = [f+(c-l)(f-l)]/c. 
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The typical value of f can be taken as 2 and for large circuits c is 
expected to be greater than 10. For f = 2 and c = 10, e will be equal to 
1.1. 

6.4.1 Time-shared parallel bus 

Since the parallel bus is time-shared, the total communication time 
will be given by the time required to transmit one event multiplied by 
the number of communication events in a simulation cycle, i.e., Ne. 
The time required to transmit one event will be the sum of bus request 
and grant time (tb rg ), address and data setup time (td 8 ), data acknowl- 
edge time (tda), and bus release time (W). An expression for the total 
communication time is: 

tc(bus) = (tbrg + tds + tda + thr)(N)(e). 

The address and data-setup time (tds) is composed of propagation 
delay without capacitance (t P d) and capacitance delay (t c d). These two 
parameters are functions of the bus length, which in turn will depend 
on the number of processors. If d is the distance between two proces- 
sors then the average distance an event has to be sent is (nd)/2. 
Typical signal delay without capacitance is 1 ns/foot and if d is 
assumed to be 0.5 foot, then t p d = 0.25n ns. Capacitance will cause an 
extra delay of about 3 ns/foot giving tcd = 0.75n ns. The expression for 
tcibus) becomes: 

te(bus) = [n + (tbrg + tda + t br )](N)(e) ns. 

Taking some typical values t brg = 100 ns, tda = 50 ns, W = 50 ns, and 
e = 1.1. The communication time per active element becomes: 

tc(bus)/N = (l.ln + 220) ns. 

This expression as a function of the number of processors in the 
multiprocessor simulator is plotted in Fig. 17a together with the 
expression for evaluation time per active element for simple evalua- 
tions and functional evaluations. Let ti be the length of the simulation 
cycle for a single-processor simulator and t m be the length of the 
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Fig. 17— Processing time and bus communication time considerations. 

simulation cycle for a multiprocessor simulator. The performance of 
the multiprocessor simulator is defined as the ratio of single-processor 
time to multiprocessor time: ti/t m . The ratio of ti to t m is plotted in 
Fig. 17b for simple evaluations and in Fig. 17c for functional evalua- 
tions. 

From Fig. 17b it can be seen that for simple evaluations the ratio ti 
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to tm is maximum for tp (se ) = t r( hus>. For this condition, the multipro- 
cessor simulator gives best speed performance over the single-proces- 
sor simulator. The value for n is about 90 for this condition. Adding 
more processors will not speed up the simulator because the commu- 
nication time will be a bottleneck. 

Note also from Fig. 17b that for t c < t p the processing time is a 
bottleneck and processors can be added (as long as n <sc N) to speed 
up the simulation time. For t t > t p , the communication time is a 
bottleneck, and a faster communication structure has to be added to 
speed up the simulation time. For best case t c = t p and for further 
speed improvements, both faster processors (or more processors if n 
<$c N ) and a faster communication structure must be added. 

It is worthwhile noticing that the communication activity will be 
slightly lower than the predicted activity for smaller values of n than 
for larger values, since the probability that two adjacent element 
strings will be in the same processor is higher for smaller values of n. 
An element string is as defined in the section on circuit partitioning 
(Section 6.1), and two adjacent element strings are two strings with 
common interconnections. This is not expected to have any significant 
impact on the above analysis. 

Figure 17c shows that for functional evaluations the processing time 
will be a bottleneck and the time-shared parallel bus provides the 
required communication speed as long as n « N. 

A problem that may be encountered in the parallel bus structure is 
data skewing. The greater the number of lines on the communication 
bus, the greater the effect of data skew. Line conditioning in the form 
of bus extenders might be required for proper operation. Also, for large 
n, a hierarchical bus structure will be required for suitable operation. 

6.4.2 Cross-point matrix 

In a cross-point based communication structure, several processors 
can simultaneously send data to other processors. The total commu- 
nication time will be governed by the processor having the maximum 
data to be sent over the communication structure, i.e. (N/n)(k)(e) 
communication events. The time required to transmit one event will 
be the sum of the channel request and grant time (t crg ), delay incurred 
in transmission of message through matrix (t,i m ), and channel release 
time (tcr). Also when the processor asks to use the matrix, the desti- 
nation processor might be busy. This requires selecting another event 
for transmission and trying to resend the blocked event at a later time. 
Let j be the number of events for which the channel is found busy and 
trb be the time wasted in processing a blocked request. An expression 
for the total communication time is: 

tc(matrix» = [tcrg + tdm + t„](N/ll)(k)(e) + (U)( j). 
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Taking some typical values t«g = 50 ns, t dm = 100 ns, t cr = 50 ns, k = 
1.1, U = 50 ns, e = 1.1 and j = 0.1(N/n) (the channel is found busy for 
10 percent of the transfer requests). The communication time per 
active element becomes: 

tc(matrix)/N = 247/n ns. 

Comparing this value with t p(se )/N = 28.6/n /xs and t P (f e) /n = 1144/n 
jus, it can be seen that the matrix will never cause a bottleneck in the 
multiprocessor simulator. 

It is interesting to note that since the matrix provides a high-speed 
communication structure, some inefficiency in the partitioning algo- 
rithm can be tolerated. For c = 1 (smallest possible average chain) and 
a worst-case average fanout of 5, e will be equal to 5 and t C (matrix)/N = 
1.105/n jus. This value is still much less than the processing time for 
simple evaluations. 

Figure 18 shows the various processing and communication times. 
Once again, the above analysis has been done for n «: N. For n < N 
and n > N the matrix communication time will remain constant. 

6.5 Comparisons 

Estimated values for t c and t p will now be derived. For realistic 
situations 1 « n « N. For a circuit with 100,000 elements, assuming 
2-percent activity per time frame, N will be 2,000. 

For simple evaluations and a parallel bus, the maximum value of n 
is 90. For n greater than 90, the performance goes down since the bus 
becomes a bottleneck. For n = 90 the length of the simulation cycle is 
tp( 8e )/N = 28.6/n = 0.32 (is. The number of evaluations per second 
becomes 3,125,000. This represents an increase by a factor of 100 over 
a traditional logic simulator (30,000 evaluations per second). The 
growth of the multiprocessor with parallel bus is restricted in the sense 
that this is the optimum performance and increasing the processors 
will not yield any further improvements. For modularity and greater 
performance, a matrix based communication structure is required for 
simple evaluations. With a cross-point matrix, the performance can be 
increased by increasing n as long as n remains much less than the 
activity N. With n = 256 and a cross-point matrix, the number of 
simple evaluations per second becomes 9,000,000. This represents an 
increase by a factor of 300 over the traditional logic simulator. For n 
= 512 and a cross-point matrix, the number of simple evaluations per 
second becomes 18,000,000. This represents an increase by a factor of 
600 over a traditional logic simulator. 

The speedup for functional evaluations with a time-shared parallel 
bus is of the same order. The maximum value of n in this case is 925. 
For n = 90, the speedup factor is 100 and for n = 256 the speedup 

2896 THE BELL SYSTEM TECHNICAL JOURNAL, DECEMBER 1 982 



1000 - 



1.0 - 



< 
3.S 



" 0.32 



I- 0.1 - 



■tl(fe)/N = 

\ 


1040 




■ii(se)/N = 


26 X. 


^*"***~ -^*p(fe)/ N 


\ 




l c(bus)/N 


- 




«p(se)/N""" 


\ «c(malrix|/N 

IV I I I 


1 1 1 1 1 



60 80 90 100 120 140 160 180 200 

nINUMBER OF PROCESSORS! 



Fig. 18 — Comparison of various communication times and processing times. 

factor is 300. However, smaller values of n will be used in practice since 
the activity will be much lower for functional evaluations. 

If slightly lower performance can be tolerated, then a nonmicropro- 
grammable microprocessor can be used. For the Intel 8086, the ele- 
ment-evaluation time becomes a = 181.5 (is. For a parallel bus, the 
value of n for optimum performance is 320. The number of simple 
evaluations per second becomes 1,600,000, an increase in performance 
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of 53 over a traditional logic simulator. Once again, however, this is 
the best performance that can be obtained using the parallel bus. 
Using a cross-point matrix with n = 512, the speed up over the 
traditional logic simulator is 90. 

VII. ARCHITECTURE CHOICE 

It is seen from the previous section that a cross-point matrix is 
preferable for simulating a circuit containing simple elements. A time- 
shared parallel bus can be used for simple evaluations, but a speed 
penalty will be incurred and modularity will be lost. A time-shared 
parallel bus is sufficient for functional evaluations. A combination of 
the cross-point matrix and parallel bus can be used to simulate circuits 
containing both simple and functional elements. 

For both simple evaluations and functional evaluations, a commu- 
nication structure consisting of both a cross-point matrix and a time- 
shared parallel bus will prove cost effective. For transferring data 
between the cross-point matrix and the parallel bus a Bus Interface 
Unit (BIU) is required. The configuration of the BIU is shown in Fig. 
19. 

Data sequencer 1 transfers data from functional evaluators con- 
nected to the parallel bus to the simple evaluators and the master 
connected to the cross-point matrix. The data sequencer receives 
signal information and data from the parallel bus and transforms it for 
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Fig. 19— Interface between parallel bus and cross-point matrix. 
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suitable transmission over the cross-point matrix. The parallel data 
received from the bus is transferred serially over the matrix. 

Data sequencer 2 sends data from the cross-point matrix to the 
parallel bus. The serial data received from the cross-point matrix is 
transferred to parallel data for suitable transmission over the parallel 
bus. 

VIII. PERFORMANCE ANALYSIS 

The effect of varying various parameters on the performance of the 
multiprocessor is considered in this section. The analysis so far has 
been done using average values. The performance of the multiproces- 
sor will be affected to some extent by variations in various parameters. 
The processing times t p(s «., and t P ( fe ) are functions of N, n, k and a. The 
communication time for bus t c( b US ) is a function of N, n, f, c. The 
communication time for matrix tc(matrix) is a function of N, n, k, f, c. 
The effect of variations in some of these parameters on performance 
of the logic simulator will be investigated. 

8. 1 Effect of changes in circuit activity N 

8.1.1 Bus architecture 

8.1.1.1 Simple evaluations. As seen from Fig. 20a, if activity N 
increases then tp( He) and tc(bus) increase proportionately. The operating 
point shifts from (1) to (2) for increasing values of N. This means that 
the length of the simulation cycle for the proposed simulator will 
increase by the same percentage that N increases. In the case of a 
single processor simulator, the simulation cycle will also increase by 
the same percentage. The performance index of the multiprocessor 
simulator (in terms of single processor time to multiprocessor time 
ratio) is not affected by the circuit activity, N (Fig. 20b). 

8.1.1.2 Functional evaluations. The processing time of the multipro- 
cessor simulator [t p( fe)] is proportional to the processing time for simple 
evaluations [t p(se )] by a slow-down factor of between 30 and 50. The 
analysis for simple evaluations done above, therefore, also applies for 
functional evaluations (i.e., the performance index of the multiproces- 
sor simulator is not affected by circuit activity). 

8.1.2 Matrix architecture (simple evaluations) 

Since the processing time [t p , se )] and matrix communication time 
[tc(matrix)] are both proportional to N, they will increase by equal 
proportions. The length of the simulation cycle is governed by tp<se) 
and the performance index (single processor to multiprocessor time) is 
given by ti/t m = (Na)/[(N/n)ka] = n/k. Once again this value is 
independent of N and the performance index of a multiprocessor 
simulator with matrix architecture is not affected by circuit activity. 
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Fig. 20— Effect of changes in circuit activity, (a) Processing time (simple evaluations) 
and bus communication time for different values of N. (b) Performance variation for (a). 

8.2 Effect of partitioning: variations in k 

8.2.1 Bus architecture 

8.2.1.1 Simple evaluations. The factor k represents the increase in 
activity in a processor, over the average N/n, due to non-ideal parti- 
tioning. An increase or decrease in k changes tp(n) in proportion but 
does not affect tc(bua). As seen from Fig. 21a for k = 1.1, the operating 
point is at (1) with n = 90. If k increases then the operating point 
moves to (2). The length of the simulation cycle increases in proportion 
to the increase in k. Since the length of the simulation cycle for the 
single-processor simulator is not affected by k, the multiprocessor 
performance index goes down. For n = 90, the performance index of 
the multiprocessor goes down from 82 to 64 as k increases from 1.1 to 
1.4. 

If the multiprocessor simulator is designed for a higher value of k, 
then the number of processors required for maximizing the perform- 
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ance index is greater than 90 and the maximum performance index 
decreases as k increases. This is shown by points (1) and (3) in Fig. 
21b. For a multiprocessor simulator designed for k = 1.1, the maximum 
performance index is 82 for n = 90. If the multiprocessor is designed 
for k = 1.4, the maximum performance index is 77. 

8.2.1.2 Functional evaluations. For functional evaluations, the curves 
for tp(fe) and tc<bus> meet for a large value of n (= 925). The length of the 
simulation cycle will be governed by the processing time. Figure 21c 
shows that for n = 100, the operating point will move from (1) towards 
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(2) as k increases from 1.1 to 1.4. Thus, the length of the simulation 
cycle will increase. The increase will not affect the single-processor 
simulator since its simulation cycle is not affected by k. Thus, the 
overall performance index will decrease as the activity increases. The 
performance index decreases from 91 to 71 as k increases from 1.1 to 
1.4 for n = 100 (see Fig. 21d). 

If the multiprocessor is designed for a higher value of k, then the 
number of processors required to maintain the performance index 
constant goes up. For example, for k = 1.1 and n = 100 the performance 
index is 91. If the simulator is to be designed for k = 1.4, then 127 
processors are required to maintain the performance index at 91. 

8.2.2 Matrix architecture (simple evaluations) 

Once again, the processing time [t p( se)] and matrix communication 
time [tc(matrix)] are proportional to k. The processing time will determine 
the length of the simulation cycle regardless of the value of k. This 
case is similar to that in the previous section for functional evaluations 
and a parallel bus (Section 8.2.1). The overall performance index will 
go down as the activity increases. Also if the multiprocessor is designed 
for a higher value of k then the number of processors required to 
maintain the performance index constant goes up. 

8.3 Effect of partitioning: variations in c 
8.3.1 Bus architecture 

8.3.1.1 Simple evaluations. Variations in c, the average number of 
elements per element string, will affect the bus communication time 
tc(bus). A decrease in c will increase the bus communication time. Figure 
22a shows that the operating point moves from (1) to (2) as the value 
of c decreases from 10 to 1. For a constant number of processors and 
a decreasing c, the communication time may become a bottleneck. The 
length of the simulation cycle will increase and the performance index 

will go down. 

If the simulator is designed for smaller values of c, the number of 
processors required will be smaller but the maximum performance 
goes down [operating point (3) in Fig. 22b]. 

8.3.1.2 Functional evaluations. For functional evaluations, the proc- 
essing time is a bottleneck. The worst-case value of c is one (i.e., chains 
of length one). Even for this case the curves for t p( f e ) and t C (hu.,> meet for 
a large value of n (= 663). Thus, the simulator for functional evalua- 
tions is not affected by c (typical value of n is 100). 

8.3.2 Matrix architecture (simple evaluations) 

As noted at end of Section 6.4.2, the processing time tp (se) will still be 
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Fig. 22— Effect of partitioning: variations in c. (a) Processing time (simple evaluations) 
and bus-communication time for different values of c. (b) Performance variation as 
affected by changes in c for (a). 



a bottleneck for worst-case value of c = 1. Thus, the performance of 
the simulator with a matrix is not affected by variations in c. 

8.4 Effect of variations in fanout, f 
8.4.1 Bus architecture 

8.4.1.1 Simple evaluations. An increase in f will increase the bus 
communication time t c(b us). The processing time will not be affected by 
changes in f. Figure 23a shows that the operating point moves from (1) 
to (2) as fanout increases. Running circuits with larger fanout on a 
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Fig. 23— Effect of variations in fanout. (a) Processing time (simple evaluations) and 
bus-communication time for different values of f. (b) Performance variation as affected 
by changes in f for (a), (c) Processing time (functional evaluations) and bus-communi- 
cation time for different values of f. (d) Performance variation as affected by changes in 
ffor(c). 



simulator designed to operate with an average fanout of 2 will cause 
the communication time to become a bottleneck. Since the processing 
time does not change, the performance index of the simulator will 
decrease. Similarly, if a simulator is designed for larger fanout its 
maximum performance index would be less than that of a simulator 
designed for a smaller number of fanout. (See Fig. 23b.) 
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8.4.1.2 Functional evaluations. The simulator for functional evalua- 
tions is sensitive to changes in f. As seen from Fig. 23c, for n = 100 and 
f > 39 the communication time becomes a bottleneck and the perform- 
ance index of the simulator goes down. The maximum performance 
index of the simulator is also lower for higher fanout. For large fanout, 
the communication time becomes a bottleneck for functional evalua- 
tions and a matrix may have to be used. 

8.4.2 Matrix architecture 

8.4.2. 1 Simple evaluations. For an average fanout of 2, the processing 
time is a bottleneck. As the fanout increases, the curves for t P (8e) and 
tc(matrix) approach each other. The communication time becomes a 
bottleneck for f > 128. This is an unrealistically large value and will 
not occur in practice. 

8.4.2.2 Functional evaluations. The effect on functional evaluations 
is the same as discussed above, but f > 5,100. 

IX. SUMMARY 

The architecture of a multiprocessor simulator has been presented. 
The speed/performance ratio of the simulator is expected to be greater 
than two orders of magnitude compared to traditional simulation 
methods implemented on general-purpose computers. The power of 
the simulator can be increased over a certain range by increasing the 
number of slaves. Also the cost of the CPU time should be much lower 
than that obtained from general-purpose computers. 

The architecture presented in this paper is expected to be faster 
than those of Barto and Szygenda, and Abramovici et al. The York- 
town Simulation Engine (YSE) built by IBM is reported to be faster 
than the architecture presented here, but the cost of the machine 
would be substantially higher since it uses special-purpose hardware. 
The architecture presented here and the YSE both try to take advan- 
tage of logic circuit concurrency to improve simulation performance. 
Unlike the YSE, our architecture implements event-driven simulation 
and is applicable to simulation with arbitrary delays at both gate and 
functional levels. Further work to be done includes detailed comparison 
of the various architectures in terms of performance and cost. 

The application of the multiprocessor simulator to fault simulation 
is being investigated at the present time. 
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APPENDIX A 

Partitioning Algorithm 

A way to partition the logic circuit is to first generate and assign 
element strings to blocks, where an element string is as defined in 
Section 6.1. The next step is to balance the load (number of elements) 
in the blocks such that all blocks have approximately the same number 
of elements. Note that the elements have to be ordered according to 
levels prior to partitioning of the logic circuit. The partitioning algo- 
rithm is detailed below: 

A. 1 Generate and assign strings 

1. i = 

[Note: at is the block to which assignments are currently being 

made.] 

2. Select an unmarked element whose fanins have all been assigned 
previously to blocks other than ai. Call the selected element e s . If 
there is no such element and there are still some unmarked 
elements, go to step 6. If all elements are marked, go to algorithm 
A. 2 (Load Balancing). 

[Note: Primary inputs can be treated as elements whose fanin has 
been previously assigned to blocks other than a;.] 

3. Assign the selected element e s to block ai and mark e 8 . 

4. If any fanout eu of e s is in a*, go to 6. 

[Note: If a fanout element has been previously assigned to the 
current block an, then assigning another fanout element to the 
current block will require sequential processing of two ele- 
ments.] 

5. If there is an unmarked fanout ek of e 8 such that no fanin (eu) 
(except e 8 ) is in a it 

then: (a) s = k 

(b) Go to step 3. 
[Note: If all of the fanout elements have been assigned to some 
other blocks, then the string cannot be extended. Note that 
primary outputs can be treated as being assigned to a null 
block.] 

6. (i) i= (i+ 1 ) (modulo n) 
(ii) Go to step 2. 

A.2 Balance load 

1. Total the number of elements in each block. 

2. a ma x = block with most elements 
amin = block with fewest elements 
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Fig. 24— Partitioning example, (a) Logic circuit to be partitioned, (b) String assign- 
ments before load balancing, (c) String assignments after load balancing. 



e m ax = number of elements in a ma x 
e m in = number of elements in amin 

3. If (e ma x - e min ) < [(total number of elements)/n]0.1, stop. 
[Note: If the maximum unbalance is less than 10%, stop.] 

4. Select string s, in a mU x such that length (si) < e ma x - e m i n . 

5. Move string Si to block amin- 

6. If (e ma x - e mi „) < [(total number of elements) /n]0.1, go to step 4. 
Else: Go to step 2. 

As an example, consider the circuit in Fig. 24 to be partitioned into 
three blocks. The blocks and strings are then assigned as shown while 
looping through steps 2 through 6 in Section A.l. Note that element 4 
is not assigned to block A since the fanout element 5 is already in block 
A (step 4 in Section A.l.). At end of load balancing element 4 will be 

assigned to block C. 

Note that zero delay elements are not considered in the algorithm. 
Any zero delay element and all of its fanouts must be in the same block 
of the partition. 
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APPENDIX B 
Glossary 

BIU— Bus Interface Unit. 

CPU — Central Processing Unit (part of the master unit). 
EOD— End of Data (flag). 
IDS — Input Data Sequencer. 
IFB— Input FIFO Buffer. 
L t — Current list of timing wheel. 

N — Average number of active elements per simulation cycle. 
ODS — Output Data Sequencer. 
OFB— Output FIFO Buffer. 
RTS — Request to Send line. 
RTSM — Request to Send to Master. 

PU — Processing Unit. (Part of a slave unit) 

a — Time required to process one active element, 
ai — A block of a partitioned circuit, 
bij — Signal connections between blocks a; and aj. 
c — Average number of elements in an element string. 
Ci — Subcircuit located in processor pi. 
dy — Data path between processors pi and pj. 
f— Average fanout of an element, 
k— Imbalance factor owing to non-ideal partitioning, 
n — Number of processors in the multiprocessor simulator. 
Pi — Processor in multiprocessor simulator. 
ti — Length of simulation cycle for a single processor simulator. 
We) — Processing time during one simulation cycle with single 

processor (functional evaluations). 
ti(se) — Processing time during one simulation cycle with single 
processor (simple evaluations), 
tbr — Bus release time for a parallel bus structure, 
tbrg— Bus request and grant time for a parallel bus structure. 
t c — Total communication time during one simulation cycle. 
tc(bus) — Total communication time during one simulation cycle for a 
parallel bus structure, 
tcjmatrix)— Total communication time during one simulation cycle for a 
matrix structure, 
td — Capacitance delay portion of address and data setup time, 

tas- 
ter — Channel release time for a matrix structure. 

tcrg — Channel request and grant time for a matrix structure, 
tda — Data acknowledge time for a parallel bus structure. 

tdm — Delay incurred in transmission of message through matrix. 
tds — Address and data setup time for a parallel bus structure. 
t m — Length of simulation cycle for a multiprocessor simulator. 
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tp — Average processing time per processor during a simulation 

cycle, where average processing time consists of the time 

required to process all active elements and schedule resulting 

events. 

tp(fe) — Average processing time per processor during one simulation 

cycle for functional evaluations. 
tp(se) — Average processing time per processor during one simulation 
cycle for simple evaluations. 
t p d — Propagation delay portion (without capacitance) of td S . 
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